\documentclass[journal]{IEEEtran}

\usepackage{color}
\usepackage{listings}
\usepackage{verbatim}
\usepackage{graphicx}
\usepackage{multicol}
\usepackage{subfig}
\usepackage{mdwlist}
\usepackage[pagebackref=true]{hyperref}%

\hyphenation{hetero-geneous op-tical net-works semi-conduc-tor}


\begin{document}
%
\title{A Case for Scaling Applications to Many-core with OS Clustering}
\author{
(reviewed by Oleg Iegorov and Jander Nascimento)}
% type author(s) between braces
\maketitle

%\section{Introduction}
This paper proposes a new way to scale operating systems for many cores
--- \emph{OS clustering}. 
%The need to scale an OS in many-core environments
%stems from the fact, that with the increasing number of cores, OS should
%efficiently share them while avoiding data contention. 
The need to scale
stems from the fact, that with the increasing number of cores in
contemporary systems, OS should
efficiently share these cores between running applications, while avoiding data contention. 
The proposed
system called \emph{Cerberus} clusters several commodity operating systems atop
a VMM and provides to applications the traditional shared memory
interface.  Cerberus adds to VMM the support for resource sharing and
communication between operating systems (running as VMs), as well as implementing system
call routing to provide an illusion for applications as running on a
single operating system.  The motivation of this approach is that
commodity operating systems (Linux, FreeBSD, etc.) can scale well with a
small number of CPU cores, while one VMM can efficiently consolidate
multiple operating systems. Thus Cerberus's approach uses several
operating systems to serve one application, without requiring any
modification of existing parallel programs.

The problem of scaling operating systems on shared memory multicore and
multiprocessor machines is not new, and there basically exist two main
approaches to solve this problem:

\begin{enumerate*}
  \item designing new operating systems from scratch;
  \item refining commodity OS kernels.
\end{enumerate*}

Cerberus uses a middle ground between these two trends, taking advantage
of both solutions.

The first main idea used by Cerberus is an extension of traditional VMMs
with support for efficient resource sharing among the clustered
operating systems. This extension is needed due to the fact that
different OS may want to share some resources (address spaces, files), as
one application can run on several OS. However, traditional VMMs (like
Xen) were designed to separate as much as possible running VMs. Solution
to this problem used by Cerberus is an addition of address range support
to VMM, as well as an efficient distributed file system.

The second idea consists in incorporation of a system call
virtualization layer. This allows processes/threads of one application
to be executed in multiple operating systems, providing users with the
illusion of running in a single OS. This layer uses the notion of
``SuperProcess'', which groups processes/threads in multiple OS to
manage the spawned processes/threads.

\section{Overview of Cerberus}

As was mentioned before, one of the main contributions of Cerberus is the
modification of VMM to allow one application run on several VMs
simultaneously. This means that processes/threads belonging to one
shared-memory application now run on multiple operating systems. To
ensure system consistency in these conditions Cerberus uses the
following mechanisms.

\begin{itemize}
  \item Single Shared Memory Interface.

    Cerberus does not require modification of implemented parallel
    applications, and thus should provide to them the existing
    shared-memory interface (e.g. POSIX API). To address this issue,
    Cerberus incorporates a system-call virtualization layer which
    coordinates system calls in multiple clustered operating systems.
    All the processes/threads running on different OSes are logically
    grouped by a \emph{SuperProcess} entity.

  \item Efficient Resource Sharing.

   As traditional VMMs are designed to isolate VMs from each other, the
   other issue addressed by Cerberus was to allow an efficient sharing
   of resources (files, networking, etc.) among processes/threads. Thus,
   Cerberus implements a resource-sharing layer in both the VMM and
   operating systems.

	\section{Process spawning} 
	
	In order to split a process into several machines, Cerberus must spawn process creation request to the target VMs. Although, it is necessary to keep the semantics of those processes unified for sake of process management, this entity is called \emph{Super Process}.
	
	Spawning such processes require the parent OS to create a new process. The state of this process is saved in the shared memory area(called \emph{checkpoint}), at this point, the target OS is able to spawn a new process and restore the state from the shared memory area. 
	
	For a better performance, Cerberus benefits from parallel fork and parallel clone of processes, which allow create those processes simultaneously over different OSes.	

	After a process has been created, it is required to keep the memory in a consistent state, so the process can interact as if they were working in the same OS. Cerberus achieved that by propagating memory management system calls (e.g. mapping and unmapping operations) into the target OS(es).
	
	Cerberus is able to modify the behavior of the OS without changing the whole OS by intercepting the basic system calls, mainly the ones relates with memory management and IO. So the process sequence is system call interception, translation and dispatch to the target OS. %[4]
	
	After dispatching the creation of several processes across different OSes, Cerberus needs a mechanism to track the relation between each of those processes. Which leads Cerberus to globally map virtual processes id. %?? 
	
	In order to allow file sharing, basic security, page fault management and mechanism, some other informations must be shared, like: \emph{page table}, \emph{inode table}. 
	
	The entire communication among the VMs are done through message-passing. Thus, its needed a efficient way of communicating, and reducing as much as possible the network traffic, this is achieved by \emph{hierarchical massage-passing mechanism}, which selects the messages that must go through the physical NIC, all other messages are simple handled by the VM and transmitted directly to the other OS
	
	\section{Resource sharing}
	
	Multiprocess applications regularly share minimal amount of information, Thus this does not is not a concern. Although, multi-threaded applications share a large amount of memory segment (process table, file descriptors, etc) and require an efficient sharing scheme to allow those threads to access the resource without creating a contention. %Cerberus share this memory sector by maintaining a global list of \emph{shared address ranges}.
	
	%should talk about pagefault in here
	
	%Sharing efficiently the page table is a keypoint, to provided that the \emph{address range abstraction} is used to share the page table by using multiple levels, that are identified by the address range. The level of sharing which should be adopted is chosen by the size of the address range.
	
	%{talk about it subprocess} 
	
	Files could be shared among different OSes using Network File System, but this adds a lot of overhead due to its intrinsic centralized characteristics.
	
	The solution is to use a mixture between local and remote file access, when the OS needs to access a resource that is located in the current OS partition, the application can simply read the file directly, without requiring Cerberus intervention. Otherwise, the request access is handled by Cerberus File System Client.%[3] %[2]
	
	Since file access in such environment hides the complexity of having several OSes underneath, this creates a problem in a simple operation like listing, which requires to dispatch the listing information to every OS involved and wait for the results before marshals back to the user, but this is not considered critical since list operation is not common for high performance applications. %references for that?
	
	Creating virtual devices and scale this virtual devices with a set of real hardware is another possibility that allows to reduce the contention. This is done by using multiple hardwares and balancing the virtual device call through different physical devices. This is possible thanks to the VM environment. One example happens with Network Interface Cards, in which a single virtual can be mapped to severed physical devices. 
	
	\section{Prototyping}
	
	Cerberus was implemented based on Xen, re-implementing a subset of the POSIX calls. This gives support to other applications without having to re-implementing them. Applications for MapReduce, MemCached and benchmarking were deployed and used to obtain the results presented. Those application did not require any re-compilation to be supported by Cerberus.
		
	%[1]
	
	\subsection{Memory management}
	
	For an optimal memory access is necessary to share their page table. A scheme called Xen direct mode was implemented but would required more changes in the guest OS. Besides Xen direct was not as efficient as P2M.
	
	Before a VM handles system calls related to memory management, it will be forced to synchronize with the other VMs.
	
	\subsection{Cerberos file system}
	
	The control of the file system is kept based on local inodes, remote inodes and dentry. Local inode represents the file that can be accessed directly, which is located in the same VM domain of the requester. Remove inode receives the owner domain, doing so, the VM is capable to tell where should request this file from. Dentry is an organization inode, which can agregate several inodes.

	\subsection{System call virtualization}

	There are mainly two types of system calls: local and global state information access.
	
	The local state access does not offer any problem neither requires any intervention by Cerberus, but the ones that modify the global state of the system needs to be translated by Cerberus.
	
	Only 35 System calls from POSIX had to be implemented. In total 1800 lines of code were added to Xen, plus a kernel module of 8800 lines of code %35 of how many!?
	
	\section{Evaluation}
	
	To evaluate the communication round trip, ping pong was used, which send a message to a target application and expects to receive the same message, the time the message take to return is the round trip time, this is used to measure the performance of the application.
	
	By the ping pong is possible to check that virtualization adds some overhead to the mechanism, but even though it is faster send a signal across VMs then sending local signals. This happens for two reasons.
	\begin{itemize}
	\item The inter-VM messaging passing is efficient
	\item signaling the target process and execute the sender can be done in parallel
	\end{itemize}
	
	Here are some numbers that summarize the result of this paper:
	\begin{itemize}
	\item Creating process (clone or fork) takes about 11 $\mu S$ %the author considers acceptable, why!?
	\item Local file access, reading first 10 bytes of a file using CFS takes 6.52 $mu S$, compared with 6.47 $mu S$ using pure Xen-linux
	\item Remote file access, reading can take up to 17.81 $\mu S$
	\item Sending and Receiving packages through Cerberus network system was 25\% faster than the native linux system
	\item Cerberus performed worst than native system when submitted to a low number of cores, but the performance increases dramatically as the number of core grows (in order of 37\%).
	\end{itemize}

\end{itemize}

%Problems:
%-author does not show the performance of CFS in writing, and seeking a file
%-sometimes redundant distributes  replicated kernels
%-centralized load balance
%-acceptable performance
%-code not available

%informations to gather
%-how many lines of code the actual kernel has?
%-how many system calls exists in POSIX?

%Absence:
%talk about crash recovery model? [1]
%what is acceptable performance file sharing??? [2]
%input the value of how slow is NFS compared with other filesystems? [3]
%marshals before returning, how to say that? [4]

\end{document}

